Methods of differentiating metastatic and non-metastatic tumors

ABSTRACT

Methods of screening for a tumor or tumor progression to the metastatic state are disclosed. The screening methods are based on the characterization of DNA by principal components analysis of spectral data yielded by Fourier transform-infrared spectroscopy of DNA samples. The methods are applicable to a wide variety of DNA samples and cancer types. A model developed using multivariate normal distribution equations and discriminant analysis is particularly well suited for distinguishing primary cancerous tissue from metastatic cancerous tissue.

TECHNICAL FIELD

[0001] The present invention is generally directed toward tumoridentification, including tumor detection and characterization. Theinvention is more particularly related to characterizing DNA based uponprincipal components analysis of spectral data yielded by Fouriertransform-infrared spectroscopy of DNA samples, in order to screen for atumor or progression of a tumor to the metastatic state.

BACKGROUND OF THE INVENTION

[0002] Despite enormous expenditures of both financial and humanresources over the last twenty-five plus years, the detection of newtumors or the recurrence of tumors remains an unfulfilled goal ofhumankind. Particularly frustrating is the fact that a number of cancersare treatable if detected at an early stage, but go undetected in manypatients for lack of a reliable screening procedure. In addition, theneed is acute for reliable screening procedures that discriminatenon-metastatic primary tumors (or non-cancerous disease states) frommetastatic tumors, or are predictive of progression to the metastaticstate. Metastasis of tumors is a major cause of treatment failure incancer patients. It is a complex process involving the detachment ofcells from the primary neoplasm, their entrance into the circulation,and the eventual colonization of local and distant tissue sites.

[0003] Frequently, physicians must err on the side of caution, andrequest that a patient undergo surgical or other procedures thatdramatically affects the patient's quality of life, withoutidentification of the disease state as a tumor with a propensity toprogress to the metastatic state. For illustrative purposes, twoparticular cancers, prostate and breast cancers, are described in moredetail and are representative of cancers in need of new approaches,which the invention disclosed herein provides.

[0004] Prostate cancer is a leading cause of death in men. Thus, thereis a keen interest in the etiology of this disease, as well as in thedevelopment of techniques for predicting its occurrence at early stagesof oncogenesis. Little is known about the etiology of prostate cancer,the most prevalent form being adenocarcinoma. However, several studieshave focused on inactivation of the tumor suppressor gene TP53 andaltered DNA methylation patterns as possible factors. In addition, freeradicals, arising from redox cycling of hormones, have recently beenimplicated in prostate cancer. This is consistent with evidence showingthat the hydroxyl radical (.OH) produces mutagenic alterations in DNA,such as 8-hydroxyguanine (8-OH-Gua) and 8-hydroxyadenine (8-OH-Ade),that have been linked to carcinogenesis in a variety of studies. Despitethese findings, virtually no understanding exists of the possiblerelationship between the .OH-modification of DNA and prostate cancer.

[0005] Prostate tissue may contain areas of benign prostatic hyperplasia(BPH), which is not regarded as a pre-malignant lesion, although itoften accompanies prostate cancer. The etiology of BPH is unknown, as isits relationship to prostate cancer. Due to the difficulties in thecurrent approaches to the diagnosis of prostate cancer, there is a needin the art for improved methods. The present invention fulfills thisneed, and further provides other related advantages.

[0006] Breast cancer is a leading cause of death in women and is themost common malignancy in women. The incidence for developing breastcancer is on the rise. One in nine women will be diagnosed with thedisease. Standard approaches to treat breast cancer have centered arounda combination of surgery, radiation and chemotherapy. In certainmalignancies, these approaches have been successful and have effected acure. However, when diagnosis is beyond a certain stage, breast canceris most often incurable. Invasive ductal carcinoma is a common form ofbreast cancer which can metastasize. Alternative approaches to earlydetection are needed. Due to the difficulties in the current approachesto the diagnosis of breast cancer, there is a need in the art forimproved methods. The present invention fulfills this need, and furtherprovides other related advantages.

[0007] DNA is continually being modified by microenvironmental factors,thus creating vast numbers of modified structures (ref. 1,2). Forexample, the progression of primary breast cancer to the metastaticstate was estimated to involve as many as several billion new DNA forms,many of which likely result from hydroxyl radical (.OH)-inducedstructural alterations (ref. 2). Progress has been made in analyzing lowmass oligonucleotides (<1×10³ base pairs) (ref. 3). However, thecomplexity and high masses of the cellular DNAs (≈6×10⁶ base pairs) havehindered their structural elucidation. Consequently, an understanding ofthese DNAs had to be obtained primarily by using destructive techniques(chemical or enzymatic) that provide little information on intactstructures potentially having important biological properties.

[0008] The development of an infrared microscope spectrometer (FIG. 14),coupled with advanced computer software, made it possible to obtainFourier transform-infrared (FT-IR) spectra from micrograms of cellularDNA (e.g., from biopsy specimens).

SUMMARY OF THE INVENTION

[0009] Briefly stated, the present invention provides methods fordefining the state of tissue, and assessing the genotoxicity of anenvironment. The inventive methods are particularly well suited fordifferentiating a T-1 (primary, non-metastatic) tumor from a metastatictumor. The invention is applicable to a wide variety of DNA samples andcancers, and to a wide variety of genotoxic environments.

[0010] In one aspect, the present invention employs the so-called“centroid” model (which may also be called the “sigmoid curve model”)with which tissue samples are analyzed. According to the centroid model,there is provided a method of screening for a tumor or tumor progressionto the metastatic state comprising the steps of: (a) subjecting a DNAsample to Fourier transform-infrared (FT-IR) spectroscopy to produceFT-IR spectral data; (b) analyzing the FT-IR spectral data of step (a)by principal components analysis (PCA); and (c) comparing the PCA ofstep (b) to the PCA of FT-IR spectra for DNA samples from non-cancerous,non-metastatic tumor or metastatic tumor samples.

[0011] In another aspect, the present invention provides a so-called“ellipsoid model” for characterizing the state of a tissue. In thisaspect, the invention provides a mathematical description correspondingto various defined states of a tissue of interest, i.e., a model.Defined states of a tissue include, e.g., normal prostate tissue, benignprostatic hyperplasia and metastatic prostate cancer, where “normal”,“benign hyperplasia” and “metastatic” are three “defined states”, andprostate tissue is the “tissue of interest”.

[0012] In brief, according to the ellipsoid model, the inventionprovides a method for defining the state, e.g., the physiological state,of a tissue, comprising the steps of:

[0013] (a) subjecting DNA from a first plurality of tissue samples toFourier transform-infrared (FT-IR) spectroscopy to produce FT-IRspectral data;

[0014] (b) analyzing the FT-IR spectral data of step (a) by principalcomponents analysis (PCA) to provide a principal component (PC) scores;

[0015] (c) applying cluster analysis to the PC scores of step (b) todistinguish outlier and non-outlier tissue samples; and

[0016] (d) generating an equation, called a first equation, that definesa multivariate version of a normal bell-shaped curve which best fits thePC values from the non-outlier tissue samples, where the first equationdefines the state of the first plurality of tissue samples.

[0017] In another embodiment, the method further includes repeatingsteps (a) through (d) above with a second plurality of tissue samples,to provide a second equation, where the second equation defines thestate of the second plurality of tissue samples. In another embodiment,the method further includes the step of applying multivariatediscrimination analysis to the first and second equations, to providefirst and second probability equations, respectively. In anotherembodiment, the method further includes the steps of: (e) subjecting aDNA sample from a tissue having a state of interest to FT-IRspectroscopy to produce FT-IR spectral data; (f) analyzing the FT-IRspectral data of step (e) by PCA to provide a set of PC scores; and (g)combining the PC scores of step (f) with each of the first and secondprobability equations to provide first and second probability scores,respectively.

[0018] In a preferred embodiment, the inventive method provides a meansfor defining (characterizing) DNA from tissues, and hence defining thetissue itself, where the method includes the steps of:

[0019] (a) subjecting a plurality (“m”) of DNA samples from a first of“n” defined states of a tissue of interest (e.g., samples of normalprostate tissue from “m” different individuals) each to Fouriertransform-infrared (FT-IR) spectroscopy to produce FT-IR spectral data;

[0020] (b) independently analyzing the FT-IR spectral data from eachsample of step (a) by principal components analysis (PCA) to provide aplurality (“o”) of principal component (PC) scores (i.e., PC1, PC2, PC3. . . PCo scores) from each of the “m” FT-IR spectra, every sample beingcharacterized by an identical number of PC scores as obtained by theidentical treatment of the FT-IR spectral data, to provide “m” sets ofPC scores, each set containing “o” values;

[0021] (c) applying cluster analysis to the set of PC scores from the“n” defined states of the tissue of interest (i.e., to all of the PC1 toPCo scores obtained from the FT-IR spectra of the “m” samples of DNA) asobtained from all of the samples, to identify outlier and non-outliertissue samples;

[0022] (d) generating an equation defining a multivariate version of anormal bell-shaped curve which best fits the non-outlier PC1 . . . PCovalues for all of the samples in the first defined state;

[0023] (e) repeating steps (c) and (d) for each of the sets of PC scoresobtained from step (b), to define a set of “n” equations, each of the“n” equations defining a multivariate version of a normal bell-shapedcurve corresponding to each of the “n” sets of PC scores; and

[0024] (f) applying multivariate discriminant analysis to the “n”equations defining multivariate versions of normal bell-shaped curves ofstep (e), to define a probability equation for the each of the “n”defined states of the tissue of interest.

[0025] According to the procedure outlined above (steps (a) through(f)), a probability equation is generated corresponding to each definedstate of interest for a particular tissue of interest, where incombination these “n” probability equations define a model.

[0026] A sample of tissue of interest having an unknown defined state isthen analyzed by FT-IR, and the spectral data obtained thereby issubjected to principal components analysis to define “o” PC scores.These “o” PC scores are then “plugged into” each of the “n” probabilityequations corresponding to the various defined states within the modelfor the same tissue of interest, to provide a number (“n”) ofprobability scores corresponding to the number of defined states fromwhich the model was constructed. A probability score is thus obtainedfor each of the defined states of the model. A higher probability scoreindicates a higher likelihood that the tissue of interest is properlycharacterized by the defined state corresponding to the probabilityequation. For example, if plugging the PC scores into the probabilityequation corresponding to normal tissue provides a probability score of“w”, and if plugging those same PC scores into the probability equationcorresponding to metastatic cancer provides a probability score of “x”,and “x”<“w”, then the sample is more likely to be normal tissue thanmetastatic cancer.

[0027] Thus, the invention further provides a method comprising thesteps of

[0028] (1) performing step (a) through (f) above, to provide a modelcomprising a number “n” of probability equations corresponding to anumber “n” of defined states for a particular tissue of interest;

[0029] (2) performing steps (g) through (j), as follows:

[0030] (g) subjecting a DNA sample from a tissue of interest having anunknown defined state, to Fourier transform-infrared (FT-IR)spectroscopy to produce FT-IR spectral data;

[0031] (h) analyzing the FT-IR spectral data of step (g) by principalcomponents analysis (PCA) to provide a plurality (“o”) of principalcomponent (PC) scores (i.e., PC1, PC2, PC3 . . . PCo scores), to providea set of “o” PC scores;

[0032] (i) “plugging in” the set of “o” PC score of step (h) into eachof the “n” probability equations which compose the model of step (f) toobtain a probability score corresponding to each of the “n” definedstates; and

[0033] (j) comparing the “n” probability scores from step (i) to oneanother in order to determine the most likely defined state into whichthe tissue having an unknown defined state is a member.

[0034] In any of the above methods, the tissue may be breast,urogenital, liver, renal, pancreatic, lung, blood, brain or colorectaltissue. In one embodiment, the tissue is cancerous, for example,cancerous breast, prostate, ovarian or endometrial tissue.

[0035] In another embodiment, the invention provides a method forassessing the genotoxicity of an environment. The method includes thesteps of:

[0036] (a) subjecting DNA from a plurality of first organism in a firstenvironment to Fourier transform-infrared (FT-IR) spectroscopy toproduce FT-IR spectral data;

[0037] (b) analyzing the FT-IR spectral data of step (a) by principalcomponents analysis (PCA) to provide a principal component (PC) scores;

[0038] (c) applying cluster analysis to the PC scores of step (b) todistinguish outlier and non-outlier organisms; and

[0039] (d) generating an equation, called a first equation, that definesa multivariate version of a normal bell-shaped curve which best fits thePC values from the non-outlier organisms, where the first equationdefines the first organisms in the first environment.

[0040] In one embodiment, the invention further includes repeating steps(a) through (d) above with DNA samples from second organisms taken froma second environment, to provide a second equation, where the secondequation defines the state of the second organisms in the secondenvironment. In another embodiment, the invention further includesapplying multivariate discrimination analysis to the first and secondequations, to provide first and second probability equations,respectively. In another embodiment, the invention provides a methodthat further includes the steps of: (e) subjecting a DNA sample of anorganism of interest from an environment of interest to FT-IRspectroscopy to produce FT-IR spectral data; (f) analyzing the FT-IRspectral data of step (e) by PCA to provide a set of PC scores; and (g)combining the PC scores of step (f) with each of the first and secondprobability equations to provide first and second probability scores,respectively.

[0041] In optional embodiments, at least one of the first and secondenvironments is a polluted environment. In another optional embodiment,the first and second organisms are non-identical, however the first andsecond environments are identical. In another optional embodiment, thefirst and second organisms are identical, however the first and secondenvironments are non-identical.

[0042] Thus, in a preferred embodiment, the present invention provides amethod for assessing the genotoxicity of an environment. The method isessentially as described above, i.e., uses the centroid or ellipsoidmodel, however the DNA samples are from organisms taken from variousenvironments. As one example, the environments may suffer from variousdegrees of pollution. In any event, according to the centroid model, themethod comprises the steps of: (a) subjecting a DNA sample of a firstorganism in an environment to Fourier transform-infrared (FT-IR)spectroscopy to produce FT-IR spectral data; (b) analyzing the FT-IRspectral data of step (a) by principal components analysis (PCA); and(c) comparing the PCA of step (b) to the PCA of FT-IR spectra for DNAsamples of: (1) the first organism prior to introduction in theenvironment of step (a), or (2) a second organism in a nonpollutedenvironment. The ellipsoid model may likewise be used in a method forassessing the genotoxicity of an environment.

[0043] These and other aspects of the present invention will becomeevident upon reference to the following detailed description andattached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0044]FIG. 1 shows a two-dimensional PC plot derived by PCA/FT-IRspectral analysis showing distinct clustering of normal, benignprostatic hyperplasia (“BPH”) and prostate cancer points. Notably, bothof the groups of prostate lesions occur to the right of the points forthe DNA of normal prostate.

[0045]FIG. 2 shows a comparison of the mean spectrum of prostate cancervs. normal tissue (FIG. 2A), BPH vs. normal tissue (FIG. 2B) andprostate cancer vs. BPH (FIG. 2C). The lower plot of each panel (A-C)shows the statistical significance of the difference in mean absorbanceat each wavenumber, based on the unequal variance t-test. P-values areplotted on the log₁₀ scale.

[0046]FIG. 3 shows Sigmoid curves depicting the probability of DNA beingclassified as normal tissue versus prostate cancer (FIG. 3A), normaltissue versus BPH (FIG. 3B), and BPH versus prostate cancer (FIG. 3C).The curves are based on the logistic regression models depicted in Table2 below. The predicted probabilities rise very rapidly over a narrowrange, which reflects a high degree of discrimination among groups and aprecipitous change in DNA structure associated with the normal to BPHand normal to prostate cancer progressions. Each sample is plotted atits predicted probability.

[0047]FIG. 4 is a three-dimensional plot of PC 1, 2 and 3 wherein eachsphere represents a DNA absorbance spectrum and the location of a sphereis determined by the “shape” of the spectrum, including height, widthand location of absorbance peaks. The core cluster of non-invasiveductal carcinoma of the breast (“IDC”) spheres in the upper part of theplot (medium stipple) is significantly smaller than the more diverse andlarger IDC_(m) cluster (heavy stipple), and the reduction mammoplastytissue (“RMT”) and metastatic invasive ductal carcinoma (“IDC_(m)”)clusters substantially overlap and are not statistically different insize;

[0048]FIG. 5A shows two spatially close IDC spectra (see arrowsindicating A and B on the three-dimensional PCA plot) wherein the twooverlaid spectra shown in FIG. 5B differ by a mean of only 3% innormalized absorbance, demonstrating the high specificity of the PCA andthe fact that spatially close spheres have almost identical spectralprofiles;

[0049]FIGS. 6A and 6B show the spectral profiles of two IDC outliers(identified in FIG. 5) compared to the spectral profile of the mean IDCcore cluster; “1” represents a multifocal carcinoma, with one focusbeing a highly malignant signet ring cell carcinoma, and “2” representsa bilateral breast cancer. In each case, the dramatic difference betweenthe mean and outlier spectrum is apparent over most of the spectralregion (see text for wavenumber—structural relationships) illustratingthe pronounced structural specificity associated with the PCs analysis;

[0050]FIG. 7 shows a centroid calculation of the spectra for the RMT,IDC, and IDC_(m) specimens on a graph plotting PC2 vs. PC1, and thedirection vectors from the RMT centroid to the IDC centroid, and the IDCcentroid to the IDC_(m) centroid;

[0051]FIG. 8 shows a centroid spectra overlay for the average RMT, IDC,and IDC_(m) species;

[0052]FIG. 9 shows a centroid spectra overlay for the average RMT, IDC,and IDC_(m) species after subtracting the mean, thus emphasizing thespectral differences between the species;

[0053]FIG. 10 shows the predicted probabilities of cancer based on FT-IRmethodology;

[0054]FIG. 11 shows a three-dimensional projection of the clusters ofpoints derived from the first three PC scores, which summarize spectralfeatures of the DNA from English sole inhabiting an essentially cleancontrol environment (QMH group) or inhabiting a chemically contaminatedurban environment (DUW group);

[0055] FIGS. 12A-12C show a comparison of the mean spectrum for each ofa QMH group and a DUW group. The lower plot of each panel shows thestatistical significance of the difference in mean absorbance at eachwavenumber, based on the unequal variance t-test. P-values are plottedon the log₁₀ scale;

[0056]FIG. 13 shows overlays of the individual spectra of QMH and DUWgroups;

[0057]FIG. 14 provides a picture and schematic diagram of a FT-IRmicroscope spectrometer. FIG. 14A shows two overlaid grand mean spectra,while FIG. 14B provides P-values obtained for each wavenumber using theunequal variance t-test.

[0058]FIG. 15A shows a three-dimensional PC plot of a breast cancer(IDC) cluster including two specimens with very similar PC scoresdesignated “a” and “b”. There are also two outliers: “c” represents theDNA of an IDC tissue from a patient with bilateral breast cancer and “d”the DNA of a multifocal carcinoma, one focus being a highly malignantsignet ring cell carcinoma;

[0059]FIG. 15B shows that the spectra “a” and “b” differ by only 3% ofmean normalized absorbance. Although the two spectra are virtuallyidentical, their corresponding PC points are spatially distinct, thusdemonstrating the high spectral specificity achieved with PCA;

[0060]FIG. 15C provides the spectrum of outlier “c” (from FIG. 15A)compared with the mean spectrum of the IDC core cluster (without theoutliers);

[0061]FIG. 15D show the spectrum of outlier “d” (from FIG. 15A) comparedwith the mean spectrum of the IDC core cluster (without the outliers).The dramatic differences between the mean and outlier spectra areapparent over most of the spectral region, resulting in the twocorresponding PC points being far away from the main cluster.

[0062]FIG. 16A is a three-dimensional plot of PC scores of DNA fromnormal breast (n=21) and breast cancer (IDC; n=37) tissues showingdistinct clustering of each group, together with the two outliers (c andd) shown in FIG. 15A

[0063]FIG. 16B is a plot of the probability of cancer with the riskscore for the normal breast and breast cancer. The cancer samples aremainly located at the upper portion of the sigmoid curve where theprobability of cancer is >61.5%, whereas the normal breast samples aresituated primarily in the lower portion. The null hypothesis that the PCscores do not discriminate between the groups is rejected with P<0.0001;

[0064]FIG. 16C is a two-dimensional plot of PC scores of DNAs fromnormal prostate (n=5), BPH (n=18) and prostate cancer (adenocarcinoma;n=8) in which the clustering is distinct (4);

[0065]FIG. 16D is a plot of the probability of cancer vs. the risk scorefor normal prostate and prostate cancer. The null hypothesis that the PCscores do not discriminate between the groups is rejected with P=0.04.The cancer outlier on the right side of the plot in FIG. 16C is in thesame direction as the progressions from normal to cancer in theprobability curve. This suggests that the DNA represented by thisoutlier has a high degree of structural modification.

[0066]FIG. 17 is a three-dimensional representation of DNA spectrum forIDC and IDCM (in analogy with FIG. 16A, which provides a similarthree-dimensional representation for normal breast tissue and breastcancer).

[0067]FIG. 18 is a plot obtained from a two-component ellipsoid modelfor discriminating metastatic breast cancer (IDC_(M)) and reductionmammoplasty tissue (RMT);

[0068]FIG. 19 is a plot obtained from a two-component ellipsoid modelfor discriminating primary breast cancer (IDC) and metastatic breastcancer (IDC_(M));

[0069]FIG. 20 is a plot obtained from a three-component ellipsoid modelfor discriminating IDC, IDC_(M) and RMT tissues;

[0070]FIG. 21 is a plot obtained from a three-component ellipsoid modelfor discriminating between normal (RMT), primary (IDC) and metastatic(IDC_(M)) breast cancer;

[0071]FIG. 22 show plots of 100 simulated normal, IDC and IDC_(M) casesbased on the multivariate normal model (i.e., the ellipsoid model)

DETAILED DESCRIPTION OF THE INVENTION

[0072] As noted above, the present invention is directed, in one aspect,toward methods of screening for a tumor or tumor progression to themetastatic state. The methods are based on the analysis of DNA. BecauseDNA is ubiquitous in all organisms, the methods of the invention are notlimited to use of a particular DNA sample. Thus, a wide variety ofcancers may be screened. Representative examples of cancers includebreast, urogenital, melanoma, liver, renal, pancreatic, lung,circulation system, nervous system or colorectal cancers. Urogenitalcancers include prostate, cervical, ovarian, bladder or endometrialcancers. Circulation system cancers include lymphomas. Nervous systemcancers include brain cancers.

[0073] As used herein, the term “screening for” includes detecting,monitoring, diagnosing or prognosticating (predicting). DNA is analyzedas described herein to screen for a tumor. As used herein, “a tumor” maybe present for the first time, or reoccurring, or in the process ofoccurring or reoccurring. The last scenarios (i.e., process of)represent opportunities for assessing, and insight into, the risk ofcancer prior to clinical manifestation. The present invention may beused to predict that cancer cells are likely to form, even though theyhave yet to appear based on currently available methodologies. DNA isalso analyzed as described herein to screen for tumor progression to themetastatic state. Progression of the tumor to the metastatic staterefers to the end point (i.e., the metastatic state) as well as anyintermediate point on the way to the end point.

[0074] The term “screening” further includes differentiating ametastatic and non-metastatic tumor. The so-called ellipsoid model, asdescribed herein, is particularly preferred for this aspect ofscreening. In fact, using the ellipsoid model, normal tissue wascorrectly identified 89% of the time (16 of 18 samples) while cancertissue was correctly identified 97% of the time (31 of 32 samples). Inaddition, using the ellipsoid model, primary (IDC) cancer was correctlyidentified 100% of the time (10 of 10 samples) while metastatic(IDC_(M)) cancer was correctly identified 82% of the time.

[0075] A “DNA sample” is DNA in, or from, any source. DNA may be removedfrom a variety of sources, including a tissue source or a fluid source.Tissue sources include tissue from an organ or membrane or skin. Fluidsources include whole blood, serum, plasma, urine, synovial, saliva,sputum, cerebrospinal fluid, or fractions thereof. With respect to atissue sample, for example, tissue may be removed from an organism bybiopsy (such as a fine needle biopsy) and the DNA extracted, all bytechniques well known to those in the art. Similarly DNA may beextracted from a fluid source using known techniques. Althoughextraction/isolation of DNA may be preferred, DNA need not beextracted/isolated in order to carry out the invention. It is possibleto examine DNA directly using Fourier transform-infrared (FT-IR)spectroscopy. For example, by specifically limiting the IR scan tocellular nuclei, spectral profiles of high concentration may begenerated. Therefore, a DNA sample may be extracted/isolated DNA or asample may include DNA.

[0076] It is possible to store tissue for later analysis of the DNA. Forexample, excised tissue may be frozen immediately in liquid nitrogen andmaintained at −80° C. Following isolation of the DNA from such tissue,it is normally dissolved in deionized water and aliquoted into portionsfor FT-IR spectroscopy. Aliquots are typically dried completely bylyophilization, purged with pure nitrogen and stored in an evacuated,sealed glass vial.

[0077] Within the present invention a DNA sample is subjected to FT-IRspectroscopy and the FT-IR spectral data analyzed by principalcomponents analysis. The starting point for the characterization of DNAin a sample is a set of IR spectra. Each spectrum shows numericalabsorbances at each integer wavenumber, i.e., generally from 4000-700cm⁻¹ and typically from 2000-700 cm⁻¹. Infrared (1R) spectra of DNAsamples are obtained with a Fourier Transform-IR spectrometer, forexample a Perkin-Elmer System 2000 (The Perkin-Elmer Corp., Norwalk,Conn.) equipped with an IR microscope and a wide-rangemercury-cadmium-telluride detector. The DNA is generally placed on abarium fluoride plate in an atmosphere with a relative humidity of lessthan ˜60% and flattened to make a transparent film. Using the IRmicroscope in a visual-observation mode, a uniform and transparentportion of the sample is selected to avoid a scattering or wedge effectin obtaining transmission spectra. Each analysis is generally performedin triplicate on 3-5 μg of DNA and the spectra were computer averaged.Generally, two hundred fifty-six scans at a 4-cm⁻¹ resolution areperformed for each analysis to obtain spectra in a frequency range of4000-700 cm⁻¹. Typically 3-5 minutes elapsed from when the glass vial isbroken to when each IR spectrum is obtained. Typically, the DNAspecimens vary in thickness, yielding a diverse set of absorbances orspectral intensities. None of the IR spectra show a 1703-cm⁻¹ band,which is indicative of specific base pairing. This fact indicates thatthe samples have acquired a disordered form, the D-configuration.

[0078] The IR spectra are obtained in transmission units and convertedto absorbance units for data processing. For example, the Infrared DataManager software package (The Perkin-Elmer Corp.) may be used to controlthe spectrometer and to obtain the IR spectra. Additionally, theGRAMS/2000 software package (Galactic Industries Corp., Salem, N.H.) maybe used to perform postrun spectrographic data analysis. Each spectrumis converted to a spreadsheet format that includes a specific absorbancefor every wavenumber from 4000 to 700 cm⁻¹.

[0079] In processing the IR data, a baseline adjustment is generallyused for all spectra to remove the effect of background absorbance. Inorder to do this, the mean absorbance across 11 wavenumbers, centered atthe lowest point (e.g., for the range 2000-700 cm⁻¹) is subtracted fromabsorbances at all frequencies. In addition, the IR data is generallynormalized. Because there is not a well-established reference peak inthe frequency range of 2000-700 cm⁻¹ useful for normalization, generallynormalization is achieved by converting all absorbances to a constantmean intensity in the range of interest. For example, the region of1750-700 cm⁻¹ (a span of 1051 wavenumbers) has been typically chosenwithin the present invention as the primary region for analysis, becauseit includes widely varying absorbances. After the removal of a baseline,described above, absorbances at all wavenumbers in a spectrum aredivided by the mean absorbance ranging form 1750 to 700 cm⁻¹ for thatspectrum, resulting in a mean spectral intensity of 1.0 for everyspecimen. All further analyses are generally performed on thesebaselined, normalized spectra (although analysis without the meanremoved is also possible).

[0080] Within the present invention, factor analysis is used to studythe variation among spectra and the relation of this variation tosubgroups, such as cancer versus non-cancer. In particular, spectraldata acquired by FT-IR spectroscopy are analyzed using a principalcomponents analysis (PCA) statistical approach. PCA is a statisticalprocedure applied to a single set of variables with the purpose ofrevealing a few variables (principal component scores or PCs) that areindependent of each other and that capture most of the information inthe original long list of variables (e.g., Timm, N.H. in MultivalentAnalysis, ed. Timm, N.H., 1975, Brooks/Cole, Monterey, Calif., pp.528-570). PCA yields a few PCs that summarize the major features thatvary across spectra. PCA may be based on over a million correlationsbetween absorbance-wavenumber values over the entire infrared spectrum.Numerous variables comprising the complex spectral relationships arereduced to a few PC scores. Each PC score is the weighted sum of thewavenumber-by-wavenumber deviations of a spectrum from the grand meanspectrum. Each PC score appears as a point in two- and three-dimensionalPC plots and represents a group of distinct and highly discriminatingstructural properties of DNA.

[0081] For example, five principal components (i.e., five dimensions)can be sufficient to describe 1051 dimensions of FT-IR spectra (with thegrand mean of all spectra subtracted from each spectrum) and visualrepresentation in two or three dimensions is adequate. PCA is availablein many basic and advanced statistical programs, such as SAS and S-Plus.

[0082] The entire analysis is generally carried out with core clustersfrom each of the three groups (DNA from non-cancerous samples,non-metastatic tumor samples, and metastatic tumor samples), although itis possible to use more or less than all three groups (e.g., two ofthree groups, or non-cancerous samples versus all tumor samplesregardless of whether metastatic or not). Using cluster analysis, thosemembers of a specified group that stood apart from others in the coregroup are identified. The isolated group members all stand apart fromany others in their group at Euclidean distances generally representingat least a 12% difference in the mean normalized absorbance, a visiblynotable difference when spectra are conventionally plotted. The coreclusters can be considered to be the more commonly encountered DNAstructural phenotypes, whereas the isolated group members (“outliers”)represent less frequent phenotypes not present in great enough numbersto study with the sample, yet overly influential in the analysis ifincluded.

[0083] Using core cluster analysis, PC scores are thus characterized interms of “outliers” and “inliers”. The PC scores which are “inliers” maythen be manipulated according to either of the centroid or ellipsoidmodels. The centroid model is discussed first below, followed by adiscussion of the ellipsoid model.

[0084] The determination of whether DNA structural changes for theprogression of non-cancerous (NC) to non-metastatic tumor (NMT) are thesame as for the progression of non-metastatic tumor (NMT) to metastatictumor (MT) is tested on the basis of centroids statistically derivedfrom groups of points. The centroid is the vector of mean absorbances ofthe 1051 individual wavenumbers from 1750 to 700 cm⁻¹. If the twoprogressions are similar, then the centroids of the three groups line upin two- and three-dimensional space.

[0085] Formally, the hypothesis that cos(θ)=1.0 is tested, where θ isthe angle between a vector x pointing from the NC to the NMT centroidand a vector y pointing from the NMT to the MT centroid. cos(θ) isdefined by cos(θ)=xy/(|x|·|y|) The vector x is indexed by wavenumbersand, at each wavenumber, contains the difference between the meannormalized absorbance of NMT spectra and the mean normalized absorbanceof NC spectra. The vector y shows the corresponding difference for MTminus NMT spectra. An angle θ=0 [which is equivalent to cos(θ)=1.0]implies that the MT is a “virtual straight ahead” continuation of theNC→NMT progression, and that the centroids line up, whereas θ≠0 impliesthat the NMT→MT progression involves a different suite of spectral(structural) changes. The hypothesis that cos(θ)=1.0 is tested using thebootstrap method (Efron and Gong, Am. Stat. 37:36-48, 1983), whichinvolves resampling with replacement from the NC, NMT, and MT coreclusters and calculation of cos(θ) for each resampling.

[0086] To determine if the populations from which the NC and NMT coreclusters are drawing have distinct centroids (i.e., distinct meanabsorbance spectra), a permutation test is carried out on the distancebetween the NC and NMT centroids, randomly permuting labels among NC andNMT samples and recalculating distances between centroids. A similarpermutation test is carried out for the distance between the NMT and MTcentroids. Finally, the sizes of the three core clusters is comparedusing the Kruskal-Wallis ANOVA and Mann-Whitney (MW) tests on thedistance of each spectrum to the centroid of its cluster. (The P valuesfrom the Kruskal-Wallis and MW tests are approximate, due to somestatistical dependence introduced when sample values are compared withtheir sample mean.)

[0087] Wavenumber-absorbance relationships of infrared spectra of DNAanalyzed by principal components analysis (i.e., PCA of FT-IR spectraldata) may be expressed as points in space. Each point represents ahighly discriminating measure of DNA structure. These PC scores can beplotted in 2- and 3-dimensional plots. The position of a spectrum in aplot is a description of how it differs from or is similar to otherspectra in the plot. Different plot symbols or clusters for differentgroups of spectra help to highlight clustering of spectra. In addition,when two groups of spectra are analyzed, logistic regression can be usedto develop a model for classifying the spectra based on their PC scores.Logistic regression is a method commonly used for classification and isavailable in many statistical software packages (such as SAS andS-Plus). The PC scores are predictors and the result is an equation (amodel) which can be used to classify specimens. Each specimen is taggedwith a numerical probability of being in the cancer group (for example)versus the non-cancer group. The results of this analysis can be plottedas a sigmoid curve with the cancer risk score (the logit of theestimated probability) on the X-axis and the estimated probability onthe Y-axis using the prediction equation, the probability for a newspecimen can also be calculated. By choosing a cut point (such as aprobability of 0.5 or greater) all specimens can be classified as canceror non-cancer (for example). The sensitivity and specificity of theclassification can also be calculated using standard methods.

[0088] Combination of FT-IR Spectroscopy with statistics

[0089] FT-IR spectra are sensitive representations of DNA structure(refs. 2, 4-6). Subtle changes, such as in redox status induced by freeradicals (refs. 1, 5, 6), will likely affect vibrational and rotationalmotion, thus altering wavenumber-absorbance relationships. Structuraldifferences between two groups of DNAs can be identified using t-testson the grand mean spectra, such as shown in FIG. 14A. The resultantP-values are given in FIG. 14B (ref. 4). The t-tests provide a P-valuefor the difference in mean absorbance at each wavenumber. In contrast,PCA is based on over a million correlations betweenabsorbance-wavenumber values over the entire spectrum (ref 2). Thenumerous variables comprising the complex spectral relationships aretaken into account and reduced to a few PC scores that are independentof each other. Each PC score is a weighted sum of thewavenumber-by-wavenumber deviations of a spectrum from the grand meanspectrum. In essence, the PC score represents a group of distinctspectral (hence, structural) properties of DNA.

[0090] Usually, the first two or three PC scores comprise 80% of thetotal variance. Three- (FIGS. 15A, 16A) or two- (FIG. 16C) dimensionalplots can be constructed based on these scores, each spectrum beingrepresented by a single point whose spatial orientation is a highlydiscriminating measure of DNA structure. Virtually identical spectra(FIG. 15B) can be separated as points in a PC plot (FIG. 15A, a and b).Moreover, two outlier points (FIG. 15A, c and d) representing spectrathat are markedly different from the mean spectrum (FIG. 15C, D) arelocated well away from the main cluster.

[0091] Logistic regression or discriminant analysis estimates aspecimen's “cancer probability” between 0.0 (non-cancer) and 1.0(cancer), based on its PC scores. Predicted cancer probabilities,derived from a model using the PC scores, are plotted vs. calculatedrisk scores (FIG. 16B, D). Probability values between those of normaland transformed tissues represent various degrees of cancer risk (refs.2,4-6). The probability-risk relationships constitute a promising basisfor screening and prognostic trials.

[0092] Applications of the FT-IR/Statistics Technology

[0093] In studies of breast cancer (refs. 2,5,6), major spectraldifferences were found for the progression normal breast→breast cancer(invasive ductal carcinoma; IDC). A three-dimensional PC plot revealed adistinct cluster of points representing the DNA of each group (FIG.16A). PC points for the IDC group were selected out and presented inFIG. 15A. Point c that represents the DNA of a patient with bilateralbreast cancer was completely separated from the main clusterrepresenting the DNA of patients with single breast tumors (ref. 2).Differences in the lesion status of a tissue were found to markedlyshift the PC point position. Point d that represents a specimencontaining a second focus of signet ring cell carcinoma, a highlymalignant lesion, is well separated from the main cluster. Theseexamples demonstrate that the FT-IR/statistics technology has apotentially high capability for elucidating DNA structural changes inrelation to a variety of biological conditions.

[0094] Normal and breast cancer PC scores, for a total of 54 samples,were analyzed using logistic regression and the resulting sigmoid curveof cancer probability vs. the risk score (FIG. 16B) showed a number oftransitional values between non-cancer and cancer. In classifying thesamples (including four additional distinct outliers) the predictivemodel had a sensitivity of 86% (percent of patients with cancercorrectly classified) and a specificity of 81% (percent of patientswithout cancer correctly classified), using 61.5% probability as thecut-point. (The cut-point was chosen to jointly maximize sensitivity andspecificity and may vary among diseases and populations.) The power ofthe model was substantiated by an independent test. Spectra ofmicroscopically normal tissue (MNT) from near the breast tumors of 11women (not included in the predictive model) were analyzed and thecorresponding PC scores were calculated. When the scores were used inthe model, ten of eleven (91%) had a predicted cancer probability >75%.Thus, on the base of their DNA structures the MNTs were classified as“high risk.” This is supported by data showing that tissue near a breasttumor has a high risk for developing a second lesion (ref. 6).

[0095] Comparisons of grand mean spectra for the progression primarybreast cancer→metastatic breast cancer showed that the structure of DNAwas markedly altered (ref. 2), as suggested by pronounced differences inspectral areas assigned to the nucleotide bases and deoxyribose. Thesechanges, attributed primarily to an increase in reactions of the .OHwith DNA, resulted in a substantial increase in structural diversitythat was calculated on the basis of PC scores as previously described(ref. 2). The determination of diversity provides a useful measure ofstructural damage to DNA, such as induced by free radicals.

[0096] A comparison of grand mean spectra in the progressions normalprostate→prostate cancer (FIG. 14A) and normal prostate→benign prostatichyperplasia (BPH) revealed for the first time that the transformationsinvolve significant structural alterations in DNA (ref. 4). The firsttwo PC scores (76% of the total variance) were used for atwo-dimensional plot (FIG. 16C). The groups showed distinct clustering.The prostate lesion clusters were located to the right of those of thenormal prostate, and the BPH cluster was located to the right of thecancer cluster. The spatial arrangement suggests that the hypotheticalprogression BPH→prostate cancer (ref 7) is unlikely because it wouldrequire a structural reversion compared to the normal→BPH transformation(ref 4). This implies that each type of lesion is biologically derivedindependently, or that there are additional alterations in the DNA ofBPH that mimic a reversal in the progression to cancer.

[0097] The probability of prostate cancer, obtained via discriminantanalysis, was plotted vs. the risk score (the logit of the probability)and revealed near separation of the groups (FIG. 16D). The discriminantmodel (calculated using a total of 12 cancer and non-cancer samples)represented the clusters as multivariate normal distributions. Inclassifying the samples (including one additional cancer outlier) thepredictive model had a sensitivity of 88% and a specificity of 80%,using 50% probability as the cut-point. The technology affords apromising opportunity for additional studies of prostate cancer, toinclude the putative etiological relationship between prostaticintraepithelial neoplasia (PIN) and adenocarcinoma and the associationof prostate specific antigen (PSA) test results with cancer probabilityvalues (ref 7).

[0098] According to the ellipsoid model (which may also be referred toas the “multivariate normal model” or “MNM”), the PC scores capturepatterns in variation in FT-IR spectra, where each PC score is aweighted sum of absorbencies by wavenumber, as stated above. Each PCscore emphasizes particular spectral regions, where a set of PC scores(about 6 scores are usually sufficient, however a fewer number of scoresmay also be satisfactory) represents each spectrum very well. The PCscores will vary across spectra, and will emphasize differences betweenspectra. Generally, 6 PC scores are sufficient to capture at least about90% of the total variation between the spectra.

[0099] The set of PC scores for a cluster (e.g., IDC_(M)) can beapproximated by a statistical model. Each PC score, e.g., PC1, can beapproximated by a “bell-shaped curve”, i.e., a Gaussian distribution.Thus, (when there are six PC scores) each of PC1, PC2, . . . , PC6 canbe approximated by a bell shaped curve separately. When several statesare analyzed together, PC1, PC2, etc. are usually correlated within agiven state (e.g., IDC_(M)). The full model is the multivariate normaldistribution, which is a mathematical equation.

[0100] The model may be viewed as infinitely many combinations of PC1,PC2, . . . PC6, etc. but some combinations are more probable thanothers. It is possible to draw a random sample from the model, and it isnot necessary to have the original data to do this (the model issufficient). If the sample is plotted (e.g., PC2 vs. PC1), the plot willshow great density where the mathematical model indicates that spectraare more likely to occur.

[0101] The model also allows construction of ellipsoids that captures≧90% (or any desired percentage) of the infinite possibilities from themodel. Mathematically, numerical methods are used to integrate the modelfunction, where integrating inside the 90% ellipsoid yields 90% of thevalue obtained by integrating over −∞ to +∞. The ellipsoid will contain90% of the probability. A randomly selected IDC_(M) spectrum, forexample, is 90% more likely to fall inside the ellipsoid generated fromIDC_(M) data. The length, width and height of a 3-dimensional ellipsoidare proportional to the standard deviation of PC score 1, PC score 2, PCscore 3, respectively, for that cluster (e.g., IDC_(M)). The actualcalculations are calculated using the chi-squared distribution.

[0102] In summary, according to the ellipsoid model, the inventionprovides a method comprising the steps of:

[0103] (a) subjecting a plurality (“m”) of DNA samples from a first of“n” defined states of a tissue of interest (e.g., samples of normalprostate tissue from “m” different individuals) each to Fouriertransform-infrared (FT-IR) spectroscopy to produce FT-IR spectral data;

[0104] (b) independently analyzing the FT-IR spectral data from eachsample of step (a) by principal components analysis (PCA) to provide aplurality (“o”) of principal component (PC) scores (i.e., PC1, PC2, PC3. . . PCo scores) from each of the “m” FT-IR spectra, every sample beingcharacterized by an identical number of PC scores as obtained by theidentical treatment of the FT-IR spectral data, to provide “m” sets ofPC scores, each set containing “o” values;

[0105] (c) applying cluster analysis to the set of PC scores from the“n” defined states of the tissue of interest (i.e., to all of the PC1 toPCo scores obtained from the FT-IR spectra of the “m” samples of DNA) asobtained from all of the samples, to identify outlier and non-outliertissue samples;

[0106] (d) generating an equation defining a multivariate version of anormal bell-shaped curve which best fits the non-outlier PC1 . . . PCovalues for all of the samples in the first defined state;

[0107] (e) repeating steps (c) and (d) for each of the sets of PC scoresobtained from step (b), to define a set of “n” equations, each of the“n” equations defining a multivariate version of a normal bell-shapedcurve corresponding to each of the “n” sets of PC scores;

[0108] (f) applying multivariate discriminant analysis to the “n”equations defining multivariate versions of normal bell-shaped curves ofstep (e), to define a probability equation for the each of the “n”defined states of the tissue of interest.

[0109] According to the procedure outlined above (steps (a) through(f)), a probability equation is generated corresponding to each definedstate of interest for a particular tissue of interest, where incombination these “n” probability equations define a model.

[0110] A sample of tissue of interest having an unknown defined state isthen analyzed by FT-IR, and the spectral data obtained thereby issubjected to principal components analysis to define “o” PC scores.These “o” PC scores are then “plugged into” each of the “n” probabilityequations corresponding to the various defined states within the modelfor the same tissue of interest, to provide a number (“n”) ofprobability scores corresponding to the number of defined states fromwhich the model was constructed. A probability score is thus obtainedfor each of the defined states of the model. A higher probability scoreindicates a higher likelihood that the tissue of interest is properlycharacterized by the defined state corresponding to the probabilityequation. For example, if plugging the PC scores into the probabilityequation corresponding to normal tissue provides a probability score of“w”, and if plugging those same PC scores into the probability equationcorresponding to metastatic cancer provides a probability score of “x”,and “x”<“w”, then the sample is more likely to be normal tissue thanmetastatic cancer.

[0111] Thus, the invention further provides a method comprising thesteps of

[0112] (1) performing step (a) through (f) above, to provide a modelcomprising a number “n” of probability equations corresponding to anumber “n” of defined states for a particular tissue of interest;

[0113] (2) performing steps (g) through (j), as follows:

[0114] (g) subjecting a DNA sample from a tissue of interest having anunknown defined state, to Fourier transform-infrared (FT-IR)spectroscopy to produce FT-IR spectral data;

[0115] (h) analyzing the FT-IR spectral data of step (g) by principalcomponents analysis (PCA) to provide a plurality (“o”) of principalcomponent (PC) scores (i.e., PC1, PC2, PC3 . . . PCo scores), to providea set of “o” PC scores,

[0116] (i) “plugging in” the set of “o” PC score of step (h) into eachof the “n” probability equations which compose the model of step (f) toobtain a probability score corresponding to each of the “n” definedstates; and

[0117] (j) comparing the “n” probability scores from step (i) to oneanother in order to determine the most likely defined state into whichthe tissue having an unknown defined state is a member.

[0118] As seen in FIGS. 18, 19, 20 and 21, the ellipsoids overlap. Infact, the full model for these two or three clusters overlap everywhere.In other words, for any given location in the three-dimensional space,there is a probability that the spectrum for that point belongs to,e.g., RMT, another probability that it belongs to IDC, and anotherprobability that it belongs to IDC_(M). However, each group (IDC,IDC_(M) and RMT) has greater density at some locations than others. Fora given sample, it is assigned to the group that has the greatestdensity at the location (PC scores) of the sample. Therefore, even wherethe 90% IDC ellipsoid is buried inside the 90% IDC_(M) ellipsoid, theIDC is likely to have greater density at much or most of these interiorpoints. Thus, a sample that provides PC data that occurs within thisoverlapping space is more likely to be an IDC.

[0119] In general, the ellipsoid model of the present invention allowsconstruction of a model to represent normal, IDC and IDC_(M)spectra/tissue. After obtaining PC scores as described above, thecorrelation and diversity of PC scores is determined. Selected data isthen fit to a statistical model with the same correlations anddiversities, based on a multivariate version of the bell-shaped curve.The model can be represented by ellipsoids containing an estimated 90%of the populations of each group.

[0120] The present invention allows for a prediction of thetransformation of breast tissue. According to the ellipsoid model, PCscores from a sample of breast tissue may be used to calculate threeprobabilities: probability that the tissue is normal, probability thatthe tissue is IDC, and probability that the tissue is IDC_(M). Thetissue is assigned to the group that gives it the highest probability.In fact, using the ellipsoid model, normal tissue was correctlyidentified 89% of the time (16 of 18 samples) while cancer tissue wascorrectly identified 97% of the time (31 of 32 samples). In addition,using the ellipsoid model, primary (IDC) cancer was correctly identified100% of the time (10 of 10 samples) while metastatic (IDC_(M)) cancerwas correctly identified 82% of the time. Thus, the ellipsoid model isparticularly well suited for correctly classifying and differentiatingprimary cancer tissue (correctly identified 97% of the time) andmetastatic cancer (correctly identified 82% of the time).

[0121] The present invention analyzes DNA samples by PCA of FT-IRspectral data and shows surprisingly that the direction of theprogression of non-cancerous (“normal”) DNA to non-metastatic tumor(“primary tumor”) DNA differs significantly from the direction of theprogression of primary tumor to metastatic tumor. By comparison of PCAof FT-IR spectra for a DNA sample of interest, to PCA of FT-IR spectrafor DNA samples from known non-cancerous, non-metastatic tumor andmetastatic tumor samples, one may determine whether the sample ofinterest is in one of these three states or progressing toward one ofthe tumor states.

[0122] For example, the present invention provides methods for thedetection of prostate cancer. The present invention applies technologyemploying principal components analysis (PCA) of Fourier-transforminfrared (FT-IR) spectroscopy (PCA/FT-IR technology) to DNA derived fromthe normal prostate, benign prostatic hyperplasia (BPH) andadenocarcinoma. As described in detail below, clusters of pointsrepresenting DNA from each of these tissues were almost completelyseparated in two-dimensional plots of principal components (PC) scores.This indicates that significant and specific structural modifications inDNA occur in the progression of normal tissue to BPH and normal tissueto prostate cancer, and that the modifications are unique for each ofthe two progressions. The structural alterations are reflected primarilyin spectral regions representing vibrations of the nucleic acids,phosphodiester and deoxyribose structures. The separation andclassification of the normal prostate versus BPH or adenocarcinoma isshown using logistic regression models of infrared spectra. Similarly,logistic regression models of DNA spectra are used herein to evaluatethe relationship between BPH and prostate cancer.

[0123] In the present characterization of DNA from prostate tissue,wavenumber-absorbance relationships of infrared spectra analyzed byprincipal components analysis (PCA) are expressed as points in space.Each point represents a highly discriminating measure of DNA structuralmodifications that altered vibrational and rotational motion offunctional groups of DNA, thus changing the spatial orientation of thepoints. Application of PCA/FT-IR technology to prostate tissue providesa virtually perfect separation of clusters of points representing DNAfrom normal prostate tissue, BPH and adenocarcinoma (prostate cancer).The progression of normal prostate tissue to BPH and to prostate cancerappears to involve structural alterations in DNA that are distinctlydifferent. Models based on logistic regression of infrared spectral dataare used to calculate the probability of a tissue being BPH oradenocarcinoma. Remarkably, the models have a sensitivity andspecificity of 100% for classifying normal versus cancer and normalversus BPH, and close to 100% for BPH versus cancer. Thus, the presentinvention shows that PCA/FT-IR technology is a powerful means fordiscriminating between normal prostate tissue, BPH and prostate cancer,with applicability for risk prediction and clinical application.

[0124] Although is it likely that the most popular use of the inventionmay be to assess the health of an individual organism with respect tocancer, it will be evident to those in a variety of arts that there areother uses. For example, the invention permits the analysis ofenvironmental hazards. By analyzing DNA (as described herein) of anorganism after exposure to an environment of unknown genotoxicity andcomparing that profile to one obtained from DNA of the organism prior toits introduction to the environment (or comparing to an organism in anonpolluted environment), an assessment of the genotoxicity of theenvironment can be made. In a preferred embodiment, the species of theorganism in a nonpolluted environment is identical to that of theorganism in the environment of unknown genotoxicity. As used herein, theterm “nonpolluted environment” includes without any chemicalcontamination or the absence of a specific pollutant or pollutants.

[0125] Importantly, the examples show that the use of theFT-IR/statistics technology has considerable promise for identifyingstructural alterations in DNA prior to the manifestation of transformedcells. These alterations can be used to establish disease probabilitymodels having potentially wide application in biology and medicine.

[0126] Other Applications

[0127] The FT-IR/statistics technology described herein focuses onbiological systems in which changes in DNA structure are known to play,or are suspected of playing, an important role in the development ofdisease. Notable examples to which the methods of the present inventionmay be directed include various forms of cancer (refs. 2, 4-6,8,9),Alzheimer's disease (ref 10), diabetes mellitus (ref. 11), heart disease(ref. 12) and Parkinson's disease and other neurodegenerative disorders(ref. 13). DNA changes are also potentially important in the putativerelationship between electromagnetic fields and cancer (ref. 14),infertility (ref. 15), radiation effects (ref. 16), aging (ref. 17),pharmacokinetic evaluations of drugs (ref. 18) and genetic alterationsin cultured cells (ref. 14). Moreover, studies linking oligonucleotideshaving different base arrangements to their corresponding spectralproperties, as revealed by statistical models, may be used to expand thescope of the technology in understanding genetic alterations.

REFERENCES

[0128] 1. Steenken, S., “Purine bases, nucleosides, and nucleotides:Aqueous solution redox chemistry and transformation reactions of theirradical cations and e⁻ and OH adducts,” Chem. Rev. 89:503-520, 1989.

[0129] 2. Malins et al., “Tumor progression to the metastatic stateinvolves structural modifications in DNA markedly different from thoseassociated with primary tumor formation,” Proc. Natl. Acad. Sci. USA93:14047-14052, 1996.

[0130] 3. Monforte, J. A. and C. H. Becker, “High-throughput DNAanalysis by time-of-flight mass spectrometry,” Nat. Med. 3:360-362,1997.

[0131] 4. Malins et al., “Models of DNA structure achieve almost perfectdiscrimination between normal prostate, benign prostatic hyperplasia(BPH), and adenocarcinoma and have a high potential for predicting BPHand prostate cancer,” Proc. Natl. Acad. Sci. USA 94:259-264, 1997.

[0132] 5. Malins et al., “Progression of human breast cancers to themetastatic state is linked to hydroxyl radical-induced DNA damage,”Proc. Natl. Acad. Sci. USA 93:2557-2563, 1996.

[0133] 6. Malins et al., “The etiology and prediction of breast cancer:Fourier transform-infrared spectroscopy reveals progressive alterationsin breast DNA leading to a cancer-like phenotype in a high proportion ofnormal women,” Cancer 75:503-517, 1995.

[0134] 7. Kirby et al., Prostate Cancer (Alfred Place, London, 1996).

[0135] 8. Camplejohn, R. S., “DNA damage and repair in melanoma andnon-melanoma skin cancer,” Cancer Surv. 26:193-206, 1996.

[0136] 9. Okamoto et al., “Analysis of DNA fragmentation in humanuterine cervix carcinoma HeLa S₃ cells treated with duocarmycins orother antitumor agents by pulse field gel electrophoresis,” Jpn. J.Cancer Res. 84:93-98, 1993.

[0137] 10. Mecocci et al., “Oxidative damage to mitochondrial DNA isincreased in Alzheimer's disease,” Ann. Neurol. 36:747-751, 1994.

[0138] 11. Dandona et al., “Oxidative damage to DNA in diabetesmellitus,” Lancet 347:444-445, 1996.

[0139] 12. Ferrari, R., “The role of mitochondria in ischemic heartdisease,” J. Cardiovasc. Pharmacol. 28(1):S1-S10, 1996.

[0140] 13. Jenner, P., “Oxidative stress in Parkinson's disease andother neurodegenerative disorders,” Pathol. Biol. (Paris) 44:57-64,1996.

[0141] 14. Dees et al., “Effects of 60-Hz fields, estradiol andxenoestrogens on human breast cancer cells,” Radial. Res. 146:444-452,1996.

[0142] 15. Sikka et al., “Role of oxidative stress and antioxidants inmale infertility,” J. Androl. 16:464-481, 1995.

[0143] 16. Algan et al., “Radiation inactivation of human prostatecancer cells: the role of apoptosis,” Radiat. Res. 146:267-275, 1996.

[0144] 17. Mandavilli, B. S. and K. S. Rao, “Accumulation of DNA damagein aging neurons occurs through a mechanism other than apoptosis,” J.Neurochem. 67:1559-1565, 1996.

[0145] 18. Wender et al., “Studies on DNA-cleaving agents: Computermodeling analysis of the mechanism of activation and cleavage ofdynemicin-oligonucleotide complexes,” Proc. Natl. Acad. Sci. USA88:8835-8839, 1991.

[0146] The following examples are offered by way of illustration and notby way of limitation.

EXAMPLES

[0147] In the Examples, the analysis of the data was according to thecentroid (also called the “sigmoid”) model. However, the dataacquisition and characterization in terms of PC scores and clusteranalysis would be the same for the ellipsoid model. In the ellipsoidmodel, the “inlier” PC scores (as identified by cluster analysis) wouldbe fitted to a multivariate normal distribution, which is essentially amultivariate generalization of the normal (Gaussian) bell shaped curve,and then the various equations describing the bell-shaped curves asobtained from a certain tissue type would be subjected to discriminantanalysis to provide probability equations. Commercially availablestatistical programs, e.g., SAS, can generate the appropriate models,and perform the necessary discriminant analysis, if the raw data (PCscores) are provided. As more data become available, the SAS programwill generate more accurate probability equations. The SAS program willalso be able to receive PC scores from a sample having an unknowndefined state, and then “plug” these values into the probabilityequations to provide probability scores for the sample have a givendefined state. Many statistics textbooks also provide descriptions ofdiscriminant analysis and the construction of multivariate normalbell-shaped curves.

[0148]FIG. 14 provides a picture and schematic diagram of a FT-IRmicroscope spectrometer (System 2000, Perkin-Elmer Corp., Norwalk,Conn.) and its use for elucidating DNA structure. DNA (10-15 μg),extracted from a split tissue, is lyophilized. The dry, fluffy DNA isrolled out on a microscope slide forming a thin, transparent film thatis peeled off with a scalpel and placed onto the BaF₂ window. Themicroscope is focused on the film when the visible beam is introducedin-path. Inserting the aperture, ten uniform areas of diameter >100 μmare chosen. The infrared beam is switched in-path and focused througheach area, scanning between 2000 and 700 cm⁻¹ after a background scan onthe BaF₂ window. The interferogram recorded in the detector isFourier-transformed to an absorbance spectrum. Each spectrum isbaselined (the mean absorbance across 11 wavenumbers, centered at theminimum absorbance between 2000 and 1700 cm⁻¹, is subtracted from thetotal absorbances) and then normalized (the entire baselined spectralabsorbances are divided by the mean between 1750 and 700 cm⁻¹) to adjustfor the sample's optical characteristics (e.g., related to filmthickness). These procedures can be carried out with simple functions inthe S-PLUS statistical package (Mathsoft Corp., Analysis ProductsDivision, Seattle, Wash.). Ultimately, a grand mean is obtained for theDNA of one type of tissue (e.g., healthy prostate) which can be comparedstatistically to that of another type of tissue (e.g., prostate cancer)(4). (FIG. 14A) two overlaid grand mean spectra. Absorbance valuesbetween 1700 and 1450 cm⁻¹ are assigned to C—O stretching and NH₂bending vibrations, and 1450-1300 cm⁻¹ to NH vibrations and CH in-planedeformations of nucleotide base. The antisymmetric stretching vibrationsof the PO₂ ⁻ structure occur at ≈1240 cm⁻¹ and vibrations of deoxyriboseare generally assigned to absorbance values between 1150 and 950 cm⁻¹(6); (FIG. 14B) P-values obtained for each wavenumber using the unequalvariance t-test. P-values ≦0.05 (shown in the regions 1590-1510 cm⁻¹ and1060-1010 cm⁻¹) are evidence for a spectral/structural differencebetween the DNA samples.

Example 1 Prostate Cancer

[0149] A. Tissue Acquisition, DNA isolation and PCA/FT-IR SpectralAnalysis: After excision, each tissue was flash frozen in liquidnitrogen. All tissues were kept at −80° C. prior to use and DNA wasmaintained under an atmosphere of pure nitrogen during the extractionprocedure to avoid oxidation. DNA was isolated from the tissues andaliquoted for FT-IR spectroscopy (about 20 μg). Each DNA sample wascompletely dried by lyophilization, purged with pure nitrogen, andstored in an evacuated, sealed glass vial at −80° C. A total of 31tissue samples were used. Five samples of prostate tissue obtained fromindividuals who died by accidents were examined histologically and foundto be normal. These served as controls. Eighteen samples of benignprostatic hyperplasia (BPH) and eight samples of adenocarcinoma (cancer)served as test samples, each comprising a portion of the histologicallyidentified lesion. All samples were obtained from the Cooperative HumanTissue Network, Cleveland, Ohio, together with related pathology data.

[0150] The IR spectra were obtained using the Perkin-Elmer System 2000equipped with an I-series microscope (The Perkin-Elmer Corp., Norwalk,Conn.). For PCA/FT-IR spectral analysis, each spectrum was normalizedacross the range of 1750 to 700 cm⁻¹, as described above. This yielded arelative absorbance value for each wavenumber, with a mean of 1.0.Euclidean distance was used to define the difference between a pair ofspectra either for the entire spectrum or for a sub-region. Thisstandard distance measure is defined as the square root of the sum ofsquared absorbance differences between spectra at each of thewavenumbers considered (e.g., 1051 for the entire spectral region1750-700 cm⁻¹). The Euclidean distance can also be expressed in a moredescriptive form as a percent. The numerator of the percent is theEuclidean distance divided by the square root of the number ofwavenumbers for a region. The denominator used here for the percent forany region is the mean normalized absorbance between 1750-700 cm⁻¹,which is 1.0 for every case.

[0151] Principal components (PC) analysis (PCA) was used to identify afew variables (components) that capture most of the information in theoriginal, long list of variables (the spectral absorbances at eachwavenumber). This reduction in the number of variables is analogous tothe process in educational testing whereby many individual test scores,such as in reading and arithmetic, are combined into a single academicperformance score. Four PC scores (e.g., four dimensions) were found tobe sufficient to describe the 1051 dimensions of the normalized spectra.PC scores were calculated with the grand mean of all spectra subtractedfrom each spectrum. The nonparametric Spearman correlation coefficientwas used to assess the association of PC scores with patient ages andGleason scores. The nonparametric analysis was used because some of thedistributions are skewed or are not normal (“bell-shaped”), which canlead to a bias in statistical significance when estimated from thePearson correlation coefficient.

[0152] Two cases, which were outliers, were omitted from these analyses,leaving 29 cases. The omitted BPH sample and the omitted cancer samplehad spectra very different from the included cases. Their Euclideandistances from the most similar spectra were 52% and 41%, respectively.All other spectra differed from their “nearest neighbor” spectrum by atmost 21%, with a majority of spectra differing by less than 11%. The twooutlier spectra show drastically reduced absorbance in the region around1650 cm⁻¹, representing vibrations of the nucleic acids.

[0153] The Kruskal-Wallis and Mann-Whitney tests were used to determineif the three groups had similar diversity, defined as the mean distanceof a spectrum to its group centroid. A permutation test was used todetermine whether the three groups tended to cluster separately(representing an internal similarity of spectral properties in a group).The distance of each spectrum to its nearest neighbor in its own group(either normal, BPH, or cancer) was calculated, and the mean of thesenearest neighbor distances for all of the spectra was the teststatistic. The test was carried out by randomly permuting groupmembership labels 10³ times and recalculating the test statistic eachtime. A smaller observed distance to the nearest neighbor than thatobtained by random relabeling of groups is an indication of clustering.A nonparametric, rank-based version of this test was carried out byexpressing each distance as a rank. For each spectrum, the distances toother spectra were ranked and the permutation test was carried out asdescribed above, but with distances replaced by ranks. The teststatistic was a mean rank. Again, a smaller observed mean rank than themean obtained from random permutation is an indication of clustering.Both the test using distance and the test using ranks were carried outfor the entire spectrum, 1750-700 cm⁻¹, and for several subregions.

[0154] Finally, logistic regression analysis was used as a model todetermine if PC scores could be used to discriminate between pairs ofDNA groups (normal versus BPH, normal versus cancer and BPH versuscancer). The logistic regression analysis yields a risk score, which isa linear combination of PC scores, and a predicted probability of asample being in one of the two groups considered (e.g., the probabilityof being BPH when BPH is compared to normal). These predictedprobabilities, along with a chosen probability cut point, can be used toclassify samples and provide estimates of sensitivity and specificity,or percent of samples correctly classified. For each analysis a cutpoint was chosen that jointly maximized sensitivity and specificity.

[0155] B. Clustering in PC Plots: PCA/FT-IR spectral analysis yieldedfour components (four PC scores per case) which explained a total of 90%of the spectral variation over 1051 wavenumbers. That is, most of thefeatures of the 29 spectra could be described by four PC scores (labeledPC 1, PC2, PC3, PC4). The first two PC scores explained 76% of thevariation and were adequate for two-dimensional representation (FIG. 1).FIG. 1 shows that the three groups were distinctly clustered. The twooutliers omitted from the analysis are also represented on this plot andappear to the right of the main clusters.

[0156] The actual distance of the outlier points to other points islarger than that shown in this two-dimensional plot due to differencesrepresented by other dimensions. The permutation test for clustering ofgroups (1750-700 cm⁻¹) yielded P=0.1, based on the distance measure, andP=0.01 using the nonparametric ranking technique (Table 1). The greatersignificance obtained by the ranking method arises from the relativeisolation of one or two cases from the core of their group (FIG. 1), aconfiguration which influences the distance measure more than theranking measure. Using these techniques, significant clustering wasobtained for two regions of the spectrum: 1174-1000 cm⁻¹ (assigned tostrong stretching vibrations of the PO₂ ⁻ and C—O groups of thephosphodiester-deoxyribose structure) and 1499-1310 cm⁻¹ (assigned toweak NH vibrations and CH in-plane deformations of the nucleic acids).The P-values for mean distance and mean rank for these regions rangedfrom 0.02 to <0.001 (Table 1). The significance levels obtained stronglyreject the null hypothesis that the observed clustering of the threegroups occurred by chance. Overall, the findings indicate that DNA isaltered in ways that produce clustering and, consequently,discrimination between normal prostate, BPH and prostate cancer DNA(FIG. 1; Tables 1 and 2).

[0157] Detailed comparisons were made between the spectra of pairs ofgroups: normal vs. cancer, normal vs. BPH and BPH vs. cancer. Thestatistical significance of differences in mean normalized absorbancebetween groups was assessed for each wavenumber between 1750-700 cm⁻¹,using the unequal variance t-test (FIG. 2; A-C). The plot shows thecomparison of the mean spectrum for each of the two groups, as well asthe P-value from the t-test. The regions with P<0.05 representdifferences between groups (e.g., normal vs. cancer) which are much lesslikely to be due to chance than regions with P>0.05. Each of thespectral comparisons between groups shows statistically significantdifferences in areas of the spectrum assigned to vibrations of thephosphodiester-deoxyribose structure and the nucleic acids. The spectralregions with significant differences in absorbance for thephosphodiester-deoxyribose structure are similar (≈1050-1000 cm⁻¹);however, absorbances associated with the nucleic acids vary among thegroups. That is, for the normal-cancer comparison, the region ofsignificant difference is primarily ≈1475-1400 cm⁻¹ (C═O stretching andNH bending vibrations), whereas for the normal-BPH comparison it is≈1600-1500 cm⁻¹. The comparison for BPH-cancer is focused at ≈1500 cm⁻¹.For the normal-BPH and BPH-cancer comparisons, significant differencesare shown between ≈1175 to 1120 cm⁻¹, a region that likely includessymmetric stretching vibrations of the PO₂ group. The difference inmeans at all of these spectral regions is apparent from the plots ofmean spectra per group in FIG. 2. The structural modifications arepivotal in the spatial distribution of points in the PC plot (FIG. 1)and in the pronounced discrimination between clusters (Table 1). TABLE 1Mean distance to nearest neighbor of same group and permutation test fornon-random clustering. Distance is expressed as a percent differencebetween spectra; 10³ permutations were performed for each spectralsub-region. Mean distance¹ Mean rank² Spectral region random random(cm⁻¹) observed permutation P-value observed permutation P-value1750-700 12.2 12.8 0.1 2.0 3.0 0.01 1750-1500 12.3 12.3 0.5 2.4 3.0 0.091499-1310 5.9 6.5 0.02 1.6 3.0 <0.001 1309-1175 6.7 6.5 0.7 3.0 3.0 0.51174-1000 13.2 15.0 0.02 2.0 3.0 0.01  999-700 6.9 7.4 0.1 2.3 3.0 0.05

[0158] C. Cluster diversity: The diversity of the three groups,expressed as the mean distance to the group centroid, did not differsignificantly (p=0.8). However, the normal prostate group was slightlyless diverse (mean distance=11.7%) than was the BPH group (meandistance=14.5%) or prostate cancer group (mean distance=13.9%).Increased structural diversity generated in primary tumors is likely animportant factor in selecting DNA forms that potentially give rise tomalignant cell populations.

[0159] D. Group Classification: PC scores can be readily used toclassify patients into groups when pairs of groups are compared usinglogistic regression. The logistic regression model (Table 2) is anequation which yields a risk score, R, when the values of the PC scoresare inserted into the equation. R is transformed to a probability by thefollowing standard statistical equation: probability=exp(R)/[1+exp(R)].A cut point is chosen and if the probability exceeds this cut point, thecase would be classified as BPH. The actual cut points are noted below.As shown in Table 2, the model for normal versus cancer and normalversus BPH correctly classifies each group 100% and 100% overall(P-values in each case were <0.001). The correct classification rate forcancer versus BPH was close to 90%, based on a designation of “cancer”for a predicted probability of ≧0.1. (Probability cut-points of 0.15 to0.41 achieve the same correct classification rates in the BPH vs. cancercomparison.) The predicted probabilities based on the models in Table 2are given in FIG. 3. The individual risk score is based on theappropriate PC model (Table 2) and the predicted probability is amathematical function of the risk score, as noted above. All of the BPHand cancer cases have predicted probabilities extremely close to 1.0 andall of the normal cases have predicted probabilities of ≦0.002 when BPHor cancer are compared to normal cases. These marked distinctions inpredicted probabilities confirm the clear separation of groups, as shownin FIG. 1. When cancer is compared to BPH, predicted cancerprobabilities ranged from 0.42 to 1.00 and predicted BPH probabilitiesranged from 0.00 to 0.65.

[0160] The two outliers omitted from the analyses tend to support thefindings. The outlier BPH and cancer points lie to the right in the PCplot (FIG. 1). This is the same direction found with the progressionsfrom normal to BPH and from normal to cancer, suggesting that theoutlier DNAs have a higher degree of structural modification. When themodels shown in Table 2 were used to classify the two outliers, the BPHoutlier was correctly classified, using the normal versus BPH model,with a predicted BPH probability close to 1.0. The cancer outlier isalso correctly classified in the normal versus cancer model with apredicted cancer probability close to 1.0. In the BPH versus cancermodel, the BPH outlier is correctly classified with a predicted cancerprobability close to zero; however, the cancer outlier is incorrectlyclassified as a BPH with a cancer probability close to zero. TABLE 2Logistic regression models for probability of BPH (vs. Normal), Cancer(vs. Normal) and Cancer (vs. BPH). Normal, n = 5. BPH, n = 17. P-valuesare based on the null hypothesis that each model is not predictive ofgroup membership. P-values are calculated from a chi-square test onchange in deviance. Coefficients ± Standard Errors Model Intercept PC1PC2 PC3 PC4 normal vs. BPH 24.9 ± 0.1  5.2 ± 0.2 5.8 ± 0.04 3.9 ± 0.03 —normal vs. Cancer 34.3 ± 0.1  12.0 ± 0.04 — — −21.0 ± 0.1 BPH vs. Cancer−14.5 ± 8.1  −4.5 ± 2.6 — — −11.1 ± 6.3 Correct Classification RateModel By Group Overall P-Value* normal vs. BPH normal: 100%; BPH: 100%100% <0.001 normal vs. Cancer normal: 100%; Cancer: 100% 100% <0.001 BPHvs. Cancer BPH: 88%; Cancer: 100%  92% <0.001

[0161] E. Age and Gleason Score relationships: Age does not appear to bea factor in creating the pronounced distinctions among groups, althoughthe incidence of prostate cancer increases significantly over the age of50 years. The age ranges for the three groups were 16-73 years fornormal (n=5); BPH, 58-73 (n=17); and cancer, 61-76 (n=7). Among theSpearman correlations of age with each of the four PC scores, none werestatistically significant (P<0.05). In all, 28 correlations wereconsidered, consisting of age correlated with each PC score in each ofthe three groups, as well as in all pairs of groups (e.g., agecorrelated with each PC score in normal and BPH tissue combined) and inthe entire pooled set of 29 cases. Spearman correlations ranged inmagnitude from 0.01 to 0.59 with P=0.09 to P=1.0. The most significantcorrelation was r=−0.51 between age and PC4 in the combined normal andcancer groups (P=0.09). When PC4 was omitted from the logisticregression analysis and models were based on PC1-PC3, the P-valuescorresponding to those in Table 2 were, top to bottom, P<0.001, P<0.001and P=0.005, again supporting a non-random distinction among the groups.These results based on PC4 and the weak or nonsignificant correlationsbetween age and other PC scores do not support any role for age in theability to use spectra to distinguish among the groups.

[0162] The Gleason score, which uses microscopically evincedarchitectural changes to classify tumor status, had little associationwith the PC scores, although based on the n=7 cancer cases, there waslimited power to detect other than strong associations. SpearmanCorrelations of PC scores 1-4 with the Gleason score ranged from −0.49to +0.26, with P=0.2 to 0.8.

[0163] F. Logistic Regression Models of Probability: The Sigmoid curves(FIG. 3) for the prostate show sharp transitions between the normal andcancer states and normal and BPH states. These transitions arecharacterized by a lack of cases at intermediate probabilities,corresponding to the clear separation of groups in FIG. 1. Thus, at somepoint in the modification of DNA, critical structural changes apparentlytake place that lead to a rapid increase in cancer probability.

[0164] BPH is not known to be etiologically related to prostate cancer;however, it is of interest that the BPH versus prostate cancer curve(FIG. 3C) shows several cases having intermediate probabilities. Theconfiguration of cases in FIG. 1 also provides some insight into thecontroversial view that BPH is a direct precursor of prostate cancer.The findings do not support this concept in that the BPH group lies“beyond” the cancer group, starting from the normal group. Thispositioning suggests that a transition from BPH to cancer would involvea reversal of some of the spectral transitions shown to be associatedwith cancer, or that there are additional changes in the BPH DNA thatmimic a reversal in the progression to cancer. Alternatively,modifications may result in DNA structures that lead to a variety ofnonneoplastic lesions, including, BPH. Although BPH may not be a directprecursor of prostate cancer, PCA/FT-IR spectral analysis may provide apromising means of predicting the occurrence of prostate cancer, basedon the structural status of BPH DNA.

[0165] The absence of transition states in the normal to cancer andnormal to BPH curves is of interest. This is likely due to the fact that“transition” tissues having DNA values between zero and 100% probability(FIG. 3, A-C) were not part of this study.

[0166] Evidence with the prostate suggests that DNA structure isprogressively altered in response to factors in the microenvironment,notably the .OH, that are likely etiologically related to thedevelopment of cellular lesions, prostate tumors (adenocarcinoma) andBPH. Intervention to forestall or correct the genetic instability ofthese tissues and likely increase in cancer risk should focus oncontrolling the cellular redox status and .OH concentrations. Theapproaches may include control of the iron-catalyzed conversion of H₂O₂to the .OH (Imlay et al., Science 240:640-642, 1988); regulation of .OHproduction resulting from redox cycling of hormones (Han and Liehr,Carcinogenesis 16:2571-2574, 1995) and environmental xenobiotics (Bagchiet al., Toxicology 104:129-140, 1995); and antioxidant/reductant therapy(Ames et al., Proc. Natl. Acad. Sci. USA 90:7915-7922, 1993; Bast etal., Am. J. Med. 91(Suppl. 3C):2S-13S, 1991).

Example 2 Breast Cancer

[0167] A. Tissue Acquisition, DNA Isolation and PCA/FT-IR SpectralAnalysis: Tissues were obtained from local Seattle hospitals and TheCooperative Human Tissue Network (Cleveland, Ohio). A total of 12tissues were obtained from 12 patients with invasive ductal carcinoma ofthe breast but having no lymph node involvement (IDC), of which one wasmultifocal (the second focus being a signet ring cell carcinoma, whichwas not evaluated) and one was bilateral breast cancer (only one ofwhich was evaluated). A total of 25 tissues were obtained from 25patients with invasive ductal carcinoma having one or more lymph nodespositive for metastatic cancer (IDC_(m)). No unusual histologiesoccurred among the non-metastatic and metastatic groups with theexception of the two IDCs mentioned. Tumor size was based on the maximumdimension of the tumor, as recorded in the pathology reports.Non-cancerous breast tissue (RMT) was obtained from 21 patients who hadundergone hypermastia surgery (reduction mammoplasty). Routine pathologyshowed no cellular changes other than occasional non-neoplastic (e.g.,fibrocystic) lesions in these tissues.

[0168] After excision, each tissue was flash frozen in liquid nitrogenand stored at −80° C. DNA was isolated from the tissues, dissolved indeionized water, and aliquoted for FT-IR spectroscopy (˜20 μg). Each DNAsample was completely dried by lyophilization, purged with purenitrogen, and stored in an evacuated, sealed glass vial at −80 C. Allsamples were analyzed by FT-IR spectroscopy.

[0169] The IR spectra were obtained using The Perkin-Elmer System 2000equipped with an 1-series microscope (The Perkin-Elmer Corp., Norwalk,Conn.). Each spectrum was specified by the absorbance at each integerwavenumber from 2000 to 700 cm⁻′. Only the interval from 1750 to 700cm⁻′, which included all major variations among spectra, was included inthis analysis. A baseline adjustment and normalization was carried out.One RMT was represented by two sections. The mean of the two adjustedand normalized spectra was used in these analyses. The multiplicativenormalizing factor was applied to absorbencies between 1750 and 700cm⁻¹. Using deuterium exchange, no evidence was found to suggest thatabsorbed moisture contributed to the spectral properties of DNA.

[0170] B. Statistical Analysis: For analysis of overall DNA structureemploying FT-IR analysis, Principal Components Analysis (PCA) was used.PCA methodology is a statistical procedure applied to a single set ofvariables with the aim of discovering a few variables (components) thatare independent of each other and which capture most of the informationin the original, long list of variables. The methodology can greatlyreduce the number of variables of concern. The PCs partition the totalvariance by finding the first PC (a linear combination of the variables)which accounts for the maximum amount of variance for the entirepopulation. The PCA methodology then finds a second combination,independent of the first PC, such that it accounts for the next largestamount of variance. This procedure continues until a number ofindependent PCAs are found that explain a significant portion of thetotal variance. In the present context, PCA was a way to identify majorfeatures of absorbance-wavenumber variation across a collection ofspectra and describe that variation succinctly.

[0171] Using PCA, it is possible to identify a few components that serveas “building blocks” for the spectra. After the PCA, each spectrum canbe represented by a few PC scores. PCA was carried out with the grandmean spectrum subtracted from individual spectra. Prior to the analysis,it was decided to retain enough components to explain at least 90% ofthe total variation (around the mean) of the data set. To determine ifsome of the differences among spectra might be due to age, thecorrelation between age and each PC score was calculated. To visualizethe spectral relationship of the cancer and non-cancer groups (IDC_(m),IDC and RMT), plots were constructed based on their first three PCscores. These two and three dimensional plots permit the simultaneousexamination of two or three of the most significant components of anysingle specimen data set and permit the meaningful comparison of eachdata set to one another.

[0172] C. Principal Components Analysis of Spectral Profiles: Spectralprofiles revealed great diversity of the IDC_(m) group and homogeneityof the IDC group. FIG. 4 shows a three-dimensional representation of thespectra based on PCA. The position in this plot is determined by theabsorbance spectrum, mainly expressed as the height, width and locationof peaks. There is a core cluster of IDCs in the upper part of the plot(indicated by yellow spheres). The two IDCs in the lower left part ofthe plot are outliers well removed from the core cluster. Notably, theseare: 1) an IDC with a second focus of signet ring cell carcinoma and 2)a bilateral breast cancer. As apparent from the plot, both the IDC_(m)cluster (magenta) and the RMT cluster (blue) are considerablylarger—indicating greater spectral diversity—than the core IDC cluster.

[0173] The size of a cluster can be measured and its spectral diversityrepresented by the mean distance of the members from the centroid of thecluster. This distance can be expressed as an approximate percentdifference in normalized absorbance per wavenumber between a clustermember and the mean spectrum for the cluster, which lies at itscentroid. The distance expressed as a percent difference is calculatedas: a) 100% times the square root of the mean squared difference innormalized absorbance across wavenumbers 1750 to 700 cm⁻¹, which is thenb) divided by 1.0, the approximate mean normalized absorbance for mostspectra. For the comparison of cluster sizes, three RMTs, three IDC_(m)sand two IDCs that lay at outlier distances from the centroid in eachgroup were removed to define a core cluster for the RMT, IDC_(m), andIDC. All outliers had at least a 20% difference from any member of theircluster. Based on centroids and distances of the remaining cases, thespectral diversity (mean distance from the centroid) was 12.4% for theIDC_(m) group, 7.3% for the IDC group, and 9.2% for the RMT group. Anapproximate P-value for the difference in diversity between groups wasbased on the Marm-Whitney test, comparing distances to the centroidswithout outliers: P=0.003 for IDC vs. IDC_(m), P=0.04 for RMT vs.IDC_(m) and P=0.4 for RMT vs. IDC. (The P-values are approximate becausedependence among distances is introduced through the calculation of thecommon centroid.)

[0174] Based on initial PCA of the 58 samples (RMT, N=21; IDC_(m), N=25;IDC, N=12), four outliers were detected—specimens whose FT-IR spectradeparted strikingly from the rest of the group and which had outlier PCscores. The PCA was repeated initially eliminating these four outliers.The PC scores were then calculated for these outliers in a mannersimilar to the others (subtracting the grand mean spectrum of the 54samples and then projecting each of the residual spectra on the PCeigenvectors). It was found that 91% of the variation in absorbance ofthe 54 samples was explained by the first five components. This impliesthat variation among spectra is highly structured. The 1051 wavenumbersfrom 1750 to 700 cm⁻¹ constitute potentially 1051 dimensions ofvariation. Over 90% of this variation can be represented by only 5dimensions.

[0175] There were only weak correlations of PC scores with age, but somecorrelations were statistically significant for all samples combined.Correlations between age and PC scores were as follows: r=0.21 forcomponent and age (P=0.1), r=0.29 for component 2 (P=0.003), r=0.03 forcomponent 3 (P=0.8), r=0.25 for component 4 (P=0.06) and r=0.30 forcomponent 5 (P=0.02). The small magnitude of these correlations suggestsvery little influence of age on spectral structure. Further, even thestatistically significant correlations (PC-2 and PC-5) appear to be anartifact because correlations between the PC scores and age in thecancer and non-cancer groups separately are very weak—less the 0.18 inmagnitude—and are non-significant (minimum P=0.4). There is a broadrange of ages for all groups which should allow a substantial truecorrelation to be detected: 17 to 89 for all samples, 26 to 89 forcancer (IDC_(m) and IDC) and 17 to 63 for RMT. There was also nostatistically significant correlation of the PC scores with the numberor percent of positive lymph nodes.

[0176]FIG. 5A depicts the overlaid spectra of the two “outliers” (“A”and “B” in FIG. 4) that lie close together on the three-dimensional PCAplot shown in FIG. 5B. The actual spectra differ by only a mean of 3% innormalized absorbance, indicating high precision in characterizingspectral phenotypes. The two IDC outliers mentioned earlier are alsodistinct in spectral profile from the core IDC cluster. FIGS. 6A and 6Bshow these two spectra superimposed on the mean normalized spectrum ofthe IDC core cluster. Differences are notable over most of the spectralarea, but especially in the following regions: 1700 to 1350 cm⁻′, thepeak at about 1240 cm⁻′, and about 1180 to 900 cm⁻′. These regionsgenerally represent N—H and C—O vibrations of the bases, PO₂anti-symmetric stretching vibrations of phosphodiester groups, and C—Ovibrations of deoxyribose, respectively.

[0177] It was described above that the centroid for a related data set(e.g., IDC specimens) could be calculated wherein the centroid would beconsidered the weighted mean for the spectra associated with aparticular species of specimen. Such an activity is shown in FIG. 7 forPC1 and PC2 values for the three types of specimens subject to analysis.In this figure, the vector from the centroid for RMT specimens to thecentroid for the IDC specimens is shown on the left hand side of thegraph and represents the shift of spectral profiles from a RMT to an IDCstate. This direction constitutes an initial direction and establishes areference for comparison to the vector derived from the IDC centroid tothe IDC_(m) centroid. The degree of vector rotation, relative to theRMT-IDC vector, is shown in Table 3. TABLE 3 95% Confidence SpectralRegion Change in Direction Interval for Change P Value  1750-700 (cm⁻¹)94° 66-129° <0.001 1750-1550 (cm⁻¹) 86° 52-127° <0.001 1549-1300 (cm⁻¹)127° 93-154° <0.001 1299-1200 (cm⁻¹) 113° 77-164° <0.001  1199-850(cm⁻¹) 108° 65-146° <0.001  849-700 (cm⁻¹) 83° 28-148° <0.001

[0178] It therefore can be seen that the effect on DNA from the IDCstate to the IDC_(m) state is not only widespread over the analyzedspectrum, but relatively consistent. Moreover, the implication of thisdirectional change lends support to the proposition that as attackscontinue on DNA, there is a definite, quantifiable, and predictablemovement of the DNA spectral profile from one state to another.

[0179]FIGS. 8 and 9 are presented to emphasize the spectraldistinctiveness between the three species of specimens. In FIG. 8, thespectra for each centroid for each species is shown. After havingsubtracted out the grand mean from these curves, the mean deviations foreach species make readily discernible the distinguishing spectrainherent between the species as is best shown in FIG. 9.

[0180] In FIG. 10, a generally sigmoid curve is established using datasets generated by FT-IR. The transition from non-cancer to cancer issharp, indicating that the manifestation of cancer can ultimately beinitiated by a relatively small incentive, depending upon the “location”of the sample on the curve.

[0181] D. Alternative Means for Tissue Acquisition and Long-TermStorage: As an alternative means to the above described method forobtaining and preserving specimens for FT-IR analysis, it may bedesirable to embed the specimen in a paraffin block after acquisitionand initial preparation. When analysis of the specimen is desired, theparaffin-embedded tissue (PET) is dewaxed and the DNA is isolated byusing conventional techniques such as application of phenol and/orchloroform solutions. After determining the purity of the specimen, theDNA is placed in an aqueous solution, dried under vacuum, and applied tothe barium fluoride window for analysis by FT-IR.

[0182] The use of PET specimens for spectral analysis greatly increasesthe number of samples available for DNA analysis since it is not benecessary to wait and obtain special biopsies for analysis (specimenscould be easily stored and retrieved at a later time), and permitsretrospective follow-up studies of the same tissue specimens to beconducted rapidly and economically.

Example 3 Liver Cancer

[0183] A. Material and Methods: English sole were obtained from arelatively clean rural environment [Quartermaster Harbor, Wash.] and achemically contaminated urban environment [Duwamish River, Seattle,Wash.]. Their livers were examined histologically and found to becancer-free, although they contained various non-neoplastic lesionscharacteristic of fish from contaminated environments.

[0184] The Duwamish River flows into Puget Sound through a heavilyindustrialized area. The sediments contain a variety of carcinogens andother xenobiotics, such as polynuclear aromatic hydrocarbons andchlorinated pesticide residues; however, a restoration program is inprogress to reduce the sediment contamination.

[0185] Two groups of sole were obtained from the Duwamish River (DUW93,n=8; and DUW95, n=10). Because of the restoration program, the DUW95samples were expected to reflect significantly less sedimentcontamination than the DUW93 samples, but greater than the QMH samples.Fish from Quartermaster Harbor, Wash., served as controls (QMH, n=7).The lengths±SD of the QMH, DUW95 and DUW93 fish were 29.5±4.2 cm,23.6±1.6 cm and 24.1±0.8 cm, respectively. The weights were 254.3±115.0g, 125.6±16.2 g and 125.0±22.5 g.

[0186] Isolation of DNA from hepatic tissue and PCA analyses of FT-IRspectra were undertaken as described above. Each FT-IR spectrum wasnormalized over the range 1750 to 700 cm⁻¹. PCA was used to identify afew variables (components) that capture most of the information in theoriginal, long list of variables (the spectral absorbancies at each ofthe 1051 wavenumbers form 1750 to 700 cm⁻¹). PC scores were calculatedwith the grand mean of all spectra subtracted from each spectrum. Thus,the PC scores represent variations in spectral (structural) features asthey differ from the grand mean spectrun. The Kruskal-Wallis (KW) testand the Mann-Whitney (MW) test were used to calculate the statisticalsignificance of differences in PC scores between groups. The sameprocedures were used to test for differences in spectral diversity,which was defined for a group as the mean distance of spectra to thegroup centroid. The unequal variance t-test was used to compare the meannormalized absorbance between groups. The t-test was carried out at eachof the 1051 wavenumbers from 1750-700 cm⁻¹. Fish age, reflected inlength and mass, was a potentially confounding variable and thispossibility was addressed in the analysis.

[0187] B. Results: FIG. 11 shows a PCA for the first three PC scoresusing specimens obtained from a location known not to be polluted (bluespheres); specimens obtained from an area known to be polluted (yellowspheres); and specimens obtained from the same polluted area prior tosignificant clean-up and/or environmental actions to remove pollutedsediment (maroon spheres). As can be seen through inspection of thefigure, a distribution similar to that encountered with breast tissue ispresent in the DNA of fish liver.

[0188] The clusters of points derived from the first three PC scores,which summarize spectral features of the DNA from the QMH and DUWgroups, are shown in a three-dimensional projection (FIG. 11). Thehypothesis that all groups have the same mean values of PC scores (thus,similar spectra) is rejected (KW P-value <0.001) and the hypothesis thatany two of the groups have the same mean values of PC scores is alsorejected (MW P-value 0.04 to <0.001). The three groups are distinctwithout any overlap (FIG. 11). PC1 and PC2, combined, account for 94% ofthe spectral variation and thus provide a good means for representingthe variety of spectra encountered. PC3 is used for display purposes(FIG. 11), although it explains only 3% of the spectral variation.

[0189] The differences between groups occur at many frequencies. Theupper part of each panel in FIG. 12 shows the mean spectrum for each oftwo groups (QMH-DUW93; QMH-DUW95, and DUW95-DUW93). The bottom part ofthe panel shows P-values for each spectral comparison, one P-value perwavenumber. The comparisons yield P<0.05 at 78-87% of the 1051wavenumbers, thus demonstrating that the structures of the DNAs from theDUW93 and DUW95 groups are markedly different from each other and theQMH group. Accordingly, the findings substantially invalidate the nullhypothesis that the mean, normalized spectra are equal between groups.The spectral differences are notable with respect to the antisymmetricstretching vibrations of the PO₂ structure (≈1240 cm⁻¹). The band atthis spectral region is present in the QMH group, but is virtually lostin the spectra of the DUW93 and DUW95 groups. Other major differencesare evident in spectral regions representing vibrations associated withthe nucleic acids (≈1700 to 1450 cm⁻¹) and deoxyribose (≈1150 to 950cm⁻¹).

[0190] It is obvious (FIG. 11) that the samples can be 100% correctlyclassified into groups (separated) on the basis of the PC scores (Table4). TABLE 4 Principal component scores by group and statisticalsignificance of differences between groups MW MW MW KW P-value P-valueP-value Variables QMH DUW95 DUW93 P-value for QMH for QMH for DUW93Principal n = 7 n = 10 n = 8 for overall vs. vs. vs. component Mean ± SDMean ± SD Mean ± SD differences DUW93 DUW95 DUW95 PC1 −6.1 ± 1.4 −12.8 ±2.8 21.3 ± 12.3 <0.001 <0.001 <0.001 <0.001 PC2  6.1 ± 1.3  −3.3 ± 2.6−1.3 ± 1.4  <0.001 <0.001 <0.001 0.04

[0191]FIG. 11 shows that the diversity of spectra (note the spread ofpoints) is substantially greater in the DUW93 and DUW95 groups, comparedto the QMH group. The varying diversity between the groups and thespectral differences which separate them are also evident in FIG. 13 inwhich the individual spectra or each group are overlaid. The tightnessof the QMH spectra and the increasing spectral diversity from the QMH tothe Duwamish River groups is notable in the region ≈1700 to 1450 cm⁻¹,which includes strong C—O stretching and NH₂ bending vibrations of thenucleic acids. Also in the DUW93 group, compared to the other groups,there is a pronounced increase in absorbance and spectral diversity inthe 1400 cm⁻¹ region assigned to weak NH vibrations and CH in-planedeformations of the nucleic acids. The region ≈1150 to 950 cm⁻¹, whichincludes strong stretching vibrations associated with deoxyribose,increases in spectral diversity from QMH to DUW95, but tightens in theDUW93 group. The differences between the spectral properties areconsistent with the discrimination between groups shown in Table 4 andthe increased diversity of the clusters illustrated in FIG. 11.

[0192] A formal test for diversity differences (KW test for the nullhypothesis that all groups have the same mean distance to the groupcentroid) yields P=0.002, strongly suggesting unequal diversity amonggroups. These mean distances to the centroid provide a scale formeasuring diversity. A larger mean distance indicates that a group ismore spread out (FIG. 11); that is, the spectra are more diverse. TheDUW95 group has a mean distance which is four times that of the QMHgroup. representing a four-fold greater diversity (Table 5). Two of thethree pairwise comparisons of diversity are significant (p<0.05);however, the comparison between the DUW95 and DUW93 groups is notsignificant (MW P-value=0.2), although the DUW93 group (representing DNAwith the most altered base structure) is more diverse than the DUW95group. TABLE 5 Spectral diversity for three groups Distance to groupcentroid (diversity) Group Mean ± SD N QMH  2.5 ± 1.0 7 DUW95  5.8 ± 2.010 DUW93 10.2 ± 7.2 8

[0193] The varying diversities of the groups is unlikely due to agevariables. The QMH group is the most diverse in length and mass, yet itshows the least spectral diversity. The QMH group shows a length SD thatis two to five times larger than that of the DUW95 and DUW93 groups anda mass SD that is five to seven times larger. However, the mean distanceof the QMH spectra to their centroid is two to four-fold smaller thanthat of the Duwamish groups. These results would be highly inconsistentif age were a significant factor in spectral diversity. Length and massalso appear to have little effect in creating the spectral differencesby location (FIG. 1). In regression analysis, length and mass combinedexplained only 7% of the variation in PC1 and 40% of the variation inPC2. PC1 is by far the more important component in explaining spectraldiversity. Length and mass explain only about 9% of the overall spectralvariation, whereas location explains 77%.

[0194] The DNA structures isolated from the QMH, DUW95 and DUW93 fishwere each unique in that the PC plot revealed a complete separation ofclusters (FIG. 11). In addition, the DNAs from the exposed groups weresubstantially more diverse than those of the control group and the DUW93group was more diverse than the DUW95 group (Table 5, FIG. 11). Thesedistinctions, which were not significantly age-related, likely arosefrom structural features induced in DNA by different environmentalfactors. Among the environmental factors likely contributing to thecluster separations and the differences in diversity are the type,degree and duration of exposure to toxic chemicals in the sediments.Striking differences occurred between the three groups in regions of thespectra assigned to the nucleic acids and the phosphodiester-deoxyribosestructure (FIGS. 12 and 13), suggesting that alterations in thesestructures contributed substantially to the separation of clusters andthe differences in diversity among groups.

[0195] There was a statistically significant increase in the diversityof clusters representing the two Duwamish River groups, compared to thetight cluster of the reference group (FIG. 11; Table 5). Increaseddiversity may be especially important in carcinogenesis in that it setsthe stage for the selection of DNA forms that give rise to malignantcellular phenotypes. The high degree of diversity in the exposed fishgroups may serve the same function.

[0196] Cluster separation in PC plots was described above in studies ofprostate (Example 1) and breast (Example 2) cancer. With the prostate,for example, perfect discrimination was achieved between DNA from normaland adenomacarcinoma tissue. Similarly, perfect discrimination wasobtained between clusters in this Example, thus demonstrating that theDNA structures had unique properties representing new forms of DNA.Considering that fish in the Duwamish River are prone to liver tumors,the distinctly different forms of DNA found in the DUW95 and DUW93groups likely constitute critical stages in the progression to cancer.

[0197] This Example has shown that damage to the DNA of English soleexposed to environmental chemicals leads to new, diverse forms of DNA.These new forms may play a pivotal role in carcinogenesis and ultimatelycontribute to the development of liver cancer in the fish population. Inaddition, the results raise the question whether environmental chemicalsplay a role in generating the new forms of DNA found in breast andprostate cancers as described above.

[0198] All publications and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication or patent application was specificallyand individually incorporated by reference.

[0199] From the foregoing, it will be evident that, although specificembodiments of the invention have been described herein for purposes ofillustration, various modifications may be made without deviating fromthe spirit and scope of the invention.

1. A method for defining the state of tissue comprising the steps: (a)subjecting DNA from a first plurality of tissue samples to Fouriertransform-infrared (FT-IR) spectroscopy to produce FT-IR spectral data;(b) analyzing the FT-IR spectral data of step (a) by principalcomponents analysis (PCA) to provide a principal component (PC) scores;(c) applying cluster analysis to the PC scores of step (b) todistinguish outlier and non-outlier tissue samples; and (d) generatingan equation, called a first equation, that defines a multivariateversion of a normal bell-shaped curve which best fits the PC values fromthe non-outlier tissue samples, where the first equation defines thestate of the first plurality of tissue samples.
 2. A method according toclaim 1, further comprising repeating steps (a) through (d) with asecond plurality of tissue samples, to provide a second equation, wherethe second equation defines the state of the second plurality of tissuesamples.
 3. A method according to claim 2, further comprising applyingmultivariate discrimination analysis to the first and second equations,to provide first and second probability equations, respectively.
 4. Amethod according to claim 3, further comprising the steps: (e)subjecting a DNA sample from a tissue having a state of interest toFT-IR spectroscopy to produce FT-IR spectral data; (f) analyzing theFT-IR spectral data of step (e) by PCA to provide a set of PC scores;and (g) combining the PC scores of step (f) with each of the first andsecond probability equations to provide first and second probabilityscores, respectively.
 5. A method according to any of claims 1 whereinthe tissue is breast, urogenital, liver, renal, pancreatic, lung, blood,brain or colorectal tissue.
 6. A method according to claim 1 wherein thetissue is cancerous tissue.
 7. A method according to claim 6 wherein thetissue is cancerous breast, prostate, ovarian or endometrial tissue. 8.A method for assessing the genotoxicity of an environment comprising thesteps of: (a) subjecting DNA from a plurality of first organism in afirst environment to Fourier transform-infrared (FT-IR) spectroscopy toproduce FT-IR spectral data; (b) analyzing the FT-IR spectral data ofstep (a) by principal components analysis (PCA) to provide a principalcomponent (PC) scores; (c) applying cluster analysis to the PC scores ofstep (b) to distinguish outlier and non-outlier organisms; and (d)generating an equation, called a first equation, that defines amultivariate version of a normal bell-shaped curve which best fits thePC values from the non-outlier organisms, where the first equationdefines the first organisms in the first environment.
 9. A methodaccording to claim 8, further comprising repeating steps (a) through (d)with second organisms from a second environment, to provide a secondequation, where the second equation defines the state of the secondorganisms in the second environment.
 10. A method according to claim 9,further comprising applying multivariate discrimination analysis to thefirst and second equations, to provide first and second probabilityequations, respectively.
 11. A method according to claim 10, furthercomprising the steps: (e) subjecting a DNA sample of an organism ofinterest from an environment of interest to FT-IR spectroscopy toproduce FT-IR spectral data; (f) analyzing the FT-IR spectral data ofstep (e) by PCA to provide a set of PC scores; and (g) combining the PCscores of step (f) with each of the first and second probabilityequations to provide first and second probability scores, respectively.12. A method according to claim 9 wherein at least one of the first andsecond environments is a polluted environment.
 13. A method according toclaim 9 wherein the first and second organisms are non-identical,however the first and second environments are identical.
 14. A methodaccording to claim 9 wherein the first and second organisms areidentical, however the first and second environments are non-identical.